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Abstract 

Inhomogeneous random graph models encompass many network models such as stochastic block 
models and latent position models. We consider the problem of statistical estimation of the matrix of 
connection probabilities based on the observations of the adjacency matrix of the network. Taking the 
stochastic block model as an approximation, we construct estimators of network connection probabilities 
the ordinary block constant least squares estimator, and its restricted version. We show that they 
satisfy oracle inequalities with respect to the block constant oracle. As a consequence, we derive optimal 
rates of estimation of the probability matrix. Our results cover the important setting of sparse networks. 
Another consequence consists in establishing upper bounds on the minimax risks for graphon estimation 
in the L2 norm when the probability matrix is sampled according to a graphon model. These bounds 
include an additional term accounting for the “agnostic” error induced by the variability of the latent 
unobserved variables of the graphon model. In this setting, the optimal rates are influenced not only 
by the bias and variance components as in usual nonparametric problems but also include the third 
component, which is the agnostic error. The results shed light on the differences between estimation 
under the empirical loss (the probability matrix estimation) and under the integrated loss (the graphon 
estimation). 


1 Introduction 

Consider a network defined as an undirected graph with n nodes. Assume that we observe the values 
Aij G {0,1} where Aij = 1 is interpreted as the fact that the nodes i and j are connected and A tJ = 0 
otherwise. We set An = 0 for all 1 < i < n and we assume that Ajj is a Bernoulli random variable with 
parameter ( ®o)ij = P (Ajj = 1) for 1 < j < i < n. The random variables Ajj , 1 < j < i < n, are assumed 
independent. We denote by A the adjacency matrix i.e., the n x n symmetric matrix with entries Ajj for 
1 < j < i < n and zero diagonal entries. Similarly, we denote by ©0 the nxn symmetric matrix with entries 
(© 0 )ij for 1 < j < i < n and zero diagonal entries. This is a matrix of probabilities associated to the graph; 
the nodes i and j are connected with probability (©o)i j- The model with such observations A' = (Ajj, 
1 < j < i < n) is a special case of inhomogeneous random graph model that we will call for definiteness the 
network sequence model , to emphasize a parallel with the gaussian sequence model. 

1.1 Graphons and sparse graphon models 

Networks arise in many areas such as information technology, social life, genetics. These real-life networks 
are in permanent movement and often their size is growing. Therefore, it is natural to look for a well-defined 
’’limiting object” independent of the network size n and such that a stochastic network can be viewed as a 
partial observation of this limiting object. Such objects called the graphons play a central role in the recent 
theory of graph limits introduced by Lovasz and Szegedy [11]. For a detailed description of this theory we 

1 In some recent papers, it is also called the inhomogeneous Erdos-Renyi model, which is somewhat ambiguous since the 
words “Erdos-Renyi model” designate a homogeneous graph. 
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refer to the monograph by Lovasz [12]. Graphons are symmetric measurable functions W : [0, l] 2 —> [0,1]. 
In the sequel, the space of graphons is denoted by W. The main message is that every graph limit can be 
represented by a graphon. This beautiful theory of graph limits was developed for dense graphs , that is, 
graphs with the number of edges comparable to the square of the number of vertices. 

Graphons give a natural way of generating random graphs [12, 9]. The probability that two distinct 
nodes i and j are connected in the graphon model is the random variable 

(e 0 )« = wb(6,&) (i) 

where £i,..., £ n are unobserved (latent) i.i.d. random variables uniformly distributed on [0,1]. As above, the 
diagonal entries of ©o are zero. Conditionally on £ = (£i,..., £„), the observations A,j for 1 < j < i < n are 
assumed to be independent Bernoulli random variables with success probabilities (®o )ij. For any positive 
integer n, a graphon function Wo defines a probability distribution on graphs of size n. Note that this model 
is different from the network sequence model since the observations Ajj are no longer independent. If Wo 
is a step-function with k steps, we obtain an analog of the stochastic block model with k groups. More 
generally, many exchangeable distributions on networks [9], including random dot product graphs [14] and 
some geometric random graphs [13] can be expressed as graphons. 

Given an observed adjacency matrix A' sampled according to the model (1), the graphon function Wo 
is not identifiable. The topology of a network is invariant with respect to any change of labelling of its 
nodes. Consequently, for any given function Wo(-,-) and a measure-preserving bijection r : [0,1] —> [0,1] 
(with respect to the Lebesgue measure), the functions Wo(x,y) and WJ" (x,y) := Wo(r(:r), r(y)) define the 
same probability distribution on random graphs. The equivalence classes of graphons defining the same 
probability distribution are characterized in a slightly more involved way based on the following result. 

Proposition 1.1 ([12, Sect.10]). Two graphons U and W in W are called weakly isomorphic if there exist 
measure preserving maps (p, ip: [0,1] —> [0,1] such that U^ almost everywhere. Two graphons U and 

W define the same probability distribution if and only if they are weakly isomorphic. 

This proposition motivates considering the equivalence classes of graphons that are weakly isomorphic. 
The corresponding quotient space is denoted by W. 

It is easy to see that the expected number of edges in model (1) is a constant times the squared number of 
vertices, which corresponds to the dense case. Networks observed in the real life are often sparse in the sense 
that the number of edges is of the order o(n 2 ) as the number n of vertices tends to infinity. The situation 
in the sparse case is more complex than in the dense case. Even the almost dense case with n 2-0 ^ 1 ) edges 
is rather different from the dense case. The extremely sparse case corresponds to the bounded degree graph 
where all degrees are smaller than a fixed positive integer. It is an opposite extreme to the dense graph. 
Networks that occur in the applications are usually between these two extremes described by dense and 
bounded degree graphs. They often correspond to inhomogeneous networks with density of edges tending to 
0 but with the maximum degree tending to infinity as n grows. 

For a given p n > 0, one can modify the definition (1) to get a random graph model with 0(p n n 2 ) edges. 
It is usually assumed that p n —> 0 as n —> oo. The adjacency matrix A' is sampled according to graphon 
Wo € W with scaling parameter p n if for all j <i, 

(&oh =PnW 0 (£,6) (2) 

where p n > 0 is the scale parameter that can be interpreted as the expected proportion of non-zero edges. 
Alternatively, model (2) can be considered as a graphon model (1) that has been sparsified in the sense that 
its edges have been independently removed with probability 1 — p n and kept with probability p n . This sparse 
graphon model was considered in [2, 3, 16, 17]. 


1.2 Our results 

This paper has two contributions. First, we study optimal rates of estimation of the probability matrix ©o 
under the Frobenius norm from an observation A! = (Ajj. 1 < j < i < n) in the network sequence model. 
We estimate ©o by a block-constant matrix and we focus on deriving oracle inequalities with optimal rates 


2 


(with possibly non-polynomial complexity of estimation methods). Note that estimating ©q by a kx k block 
constant matrix is equivalent to fitting a stochastic block model with k classes. Estimation of ©o has already 
been considered by [6, 17, 5] but convergence rates obtained there are far from being optimal. More recently, 
Gao et al. [10] have established the minimax estimation rates for ©o on classes of block constant matrices 
and on the smooth graphon classes. Their analysis is restricted to the dense case (1) corresponding to p n = 1 
when dealing with model (2). In this paper, we explore the general setting of model (2). In particular, our 
aim is to understand the behavior of least squares estimators when the probabilities (©o)^ can be arbitrarily 
small. This will be done via developing oracle inequalities with respect to the block constant oracle. Two 
estimators will be considered - the ordinary block constant least squares estimator, and its restricted version 
where the estimator is chosen in the loo cube of a given radius. As corollaries, we provide an extension 
for the sparse graphon model of some minimax results in [10] and we quantify the impact of the scaling 
parameter p n on the optimal rates of convergence. 

Second, we consider estimation of the graphon function Wo based on observation A 1 . In view of Propo¬ 
sition 1.1, graphons are not identifiable and can only be estimated up to weak isomorphisms. Hence, we 
study estimation of Wo in the quotient space W of graphons. In order to contrast this problem with the 
estimation of ©o, one can invoke an analogy with the random design nonparametric regression. Suppose 
that we observe (j/i,£i); i = l,...,n, that are independently sampled according to the model y = /(£) -f e 
where / is an unknown regression function, e is a zero mean random variable and £ is distributed with 
some density h on [0,1]. Given a sample of (yi,£i), estimation of / with respect to the empirical loss is 
equivalent to estimation of the vector (/(£i),..., /(£„)) in, for instance, the Euclidean norm. On the other 
hand, estimation under the integrated loss consists in constructing an estimator / such that the integral 
f(f(t) — f(t)) 2 h(t)dt is small. Following this analogy, estimation of ©o corresponds to an empirical loss 
problem whereas the graphon estimation corresponds to an integrated loss problem. However, as opposed 
to nonparametric regression, in the graphon models (1) and (2) the design £i,... ,£ n is not observed, which 
makes it quite challenging to derive the convergence rates in these settings. 

In Section 3, we obtain L 2 norm non-asymptotic estimation rates for graphon estimation on classes of step 
functions (analogs of stochastic block models) and on classes of smooth graphons in model (2). This result 
improves upon previously known bounds by Wolfe and Olhede [16]. For classes of step function graphons, we 
also provide a matching minimax lower bound allowing one to characterize the regimes such that graphon 
optimal estimation rates are slower than probability matrix estimation rates. In a work parallel to ours, 
Borgs et al. [4] have analyzed the rates of convergence for estimation of step function graphons under the 
privacy model. Apart from the issues of privacy that we do not consider here, their results in our setting 
provide a weaker version of the upper bound in Corollary 3.3, with a suboptimal rate, which is the square 
root of the rate of Corollary 3.3 in the moderately sparse zone. We also mention the paper by Choi [7] 
devoted to the convergence of empirical risk functions associated to the graphon model. 

1.3 Notation 

We provide a brief summary of the notation used throughout this paper. 

• For a matrix B , we denote by Bij (or by B, j , or by ( B),j ) its (i. j)th entry. 

• For an integer m, set [m] = {1,..., m}. 

• Let n, k and no be integers such that 2 < k < n, no < n. We denote by Z ny k,n 0 the set of all mappings 
z from [n] to [k] such that min^i^..^ \z~ 1 (a)\ > no (the minimal size of each “community” is no). For 
brevity, we set Z„, fc = 

• We denote by the class of all symmetric k x k matrices with real-valued entries. 

• The inner product between matrices D,B £ R nx " will be denoted by ( D , B) = ]>A . DijBij. 

• Denote by ||H||^ and by H-BHoo the Frobenius norm and the entry-wise supremum norm of matrix 
B £ R raxn respectively. 

• We denote by the maximal integer less than x > 0 and by [a;] the smallest integer greater than or 
equal to x. 
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• We denote by E the expectation with respect to the distribution of A if we consider the network 
sequence model and the expectation with respect to the joint distribution of (£, A) if we consider the 
graphon model. 

• We denote by C positive constants that can vary from line to line. These are absolute constants unless 
otherwise mentioned. 

• We denote by A the Lebesgue measure on the interval [0,1], 

2 Probability matrix estimation 

In this section, we deal with the network sequence model and we obtain optimal rates of estimation in the 
Frobenius norm of the probability matrix ©o. Fixing some integer k > 0, we estimate the matrix ©o by a 
block constant matrix with k x k blocks. 

2.1 Least squares estimator 

First, we study the least squares estimator © of ©o in the collection of all k x k block constant matrices 
with block size larger than some given integer no- Recall that we denote by Z n: k,n 0 the se t of all possible 
mappings z from [n] to [k] such that min tt= j. |z _1 (a)| > no- For any z £ Z nt k,n 0 , Q £ we define 

the residual sum of squares 

L(Q,Z)= E E (A? - Qabf 

(a,6)e[fc]x[fc] ( i,j)£z~ 1 (a)xz~ 1 (b ), j<i 

and consider the least squares estimator of (Q, z): 

(Q,z) £ arg min L{Q, z) 

QeRjym fe , 

The block constant least squares estimator of (@o)ij is defined as 0,; ; - = Qzti)i(j) f° r all * > 3- Note 
that &ij £ [0,1]. Finally, we denote by © the symmetric matrix with entries ©*j for all i > j and with 
zero diagonal entries. According to the stochastic block models terminology, Q stands for the estimated 
connection probabilities whereas z is the estimated partition of the of nodes into k communities. The only 
difference between this estimator and the one considered in [10] is the restriction |,z -1 (a)| > no- This prevents 
the partition z from being too unbalanced. A common choice of no is of the order n/k : which is referred 
to as balanced partition. Taking larger no when we have additional information allows us to obtain simpler 
estimators since we reduce the number of configurations. 

2.2 Restricted least squares estimator 

Given r £ (0,1], consider the least squares estimator restricted to the l oo ball of radius r, 

(Qr,Zr) £ arg min L(Q,z). 

Q 6Rf y x m fe : ||Q||oo<r, z£Z n , k 

The estimator © of matrix ©o is defined as the symmetric matrix with entries 0 i ■ = ( Q r )z r (i)z r (j ) for all 
i > j, and with zero diagonal entries. Note that here we consider any partitions, including really unbalanced 
ones (no = 1). 

2.3 Oracle inequalities 

Let ©*,n 0 be the best Frobenius norm approximation of ©o in the collection of matrices 

T no [k] = {© : 3z£ Z n>fcjno , Q £ Such that ©ij = Qz(i)z(j)’ i 7^ •?> and &ii = 0 

In particular, 7i[k] is the set of all probability matrices corresponding to /c-class stochastic block models 
without group size restriction. For brevity, we write 7i[k] = T[k\. and 0*,i = ©*. 
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Proposition 2.1. Consider the network sequence model. There exist positive absolute constants C\, C 2 , C 3 
such that for no > 2 , 


E 


^oIIf 


<^||0o-0 

n z 


*,n 0 |If ' 


■Call© 


0 ||oo 


/log(fc) fc 2 \ C 3 log(n/no) /log(fc) k 5 




no 




(3) 


To discuss some consequences of this oracle inequality, we consider the case of balanced partitions. 


Corollary 2.2. Consider the network sequence model. Let no > Cn/k for some C > 0 (balanced partition), 
no > 2, and ||©o||oo > fcl °s( fc ) . Then, there exist positive absolute constants C\, C 2 , such that 


E 


— 110-00 

n z 


< %|0 O -©* 

71 ^ 


,n 0 


+ C7a||0, 


0 ||oo 


/log(fc) 

V n 



(4) 


In particular, if ©0 € 7~ no [k] (i.e., when we have a fc-class stochastic block model with balanced communi¬ 
ties), the rate of convergence is of the order || © 01|00 • In [ 10 ], the rate is shown to 

be minimax over all stochastic block models with k blocks, with no restriction on || © 0 Hoc. except the obvious 
one, ||©o |loo < 1. We prove in the next subsection that the rate ||©o||oo obtained in Corollary 

2.2 exhibits the optimal dependency on || © 0 ||oo- Before doing this, we provide an oracle inequality for the 

r 

restricted least squares estimator © . 


Proposition 2.3. Consider the network sequence model. There exist positive absolute constants C\ and C 2 
such that the following holds. 7/||©o||oo < r, then 


E 




< ^||0 o -0,|||, + C , 2 r 




(5) 


As opposed to Corollary 2.2, the risk bound (5) is applicable for unbalanced partitions and for arbitrarily 
small values of ||©o||oo- However, this restricted least squares estimator requires the knowledge of an upper 
bound of || © 0 II 00 - Whereas the mean value of (@o)ij is easily inferred from the data, the maximal value 
|| ©0 | |oo is difficult to estimate. If the matrix @0 satisfies the sparse graphon model (2), one can set r = r n = 
u n A where A = 2 ^, 3 /( n ( n — 1)) is the edge density of the graph and u n is any sequence that tends to 

infinity (for example, u n = log log n). For n large enough, A/p n is close to Jj 0 Wo(x,y)dxdy in the sparse 
graphon model (2) with probability close to one, and therefore r n is greater than || © 0 II 00 - The price to pay 
for this data-driven choice of r n is that the rate in the risk bound (5) is multiplied by u n . 


2.4 Stochastic block models 


Given an integer k and any p n £ (0,1], consider the set of all probability matrices corresponding to fc-class 
stochastic block model with connection probability uniformly smaller than p n : 


T[k,p n ] = {©0 G T[k\ : ||©o||oo < Pn} ■ 

Gao et al. [ 10 ] have proved that the minimax estimation rate over T[fc, 1] is of order the ^2 
following proposition extends their lower bound to arbitrarily small p n > 0 . 

Proposition 2.4. Consider the network sequence model. For all k < n and all 0 < p n < 1, 


inf sup E @0 

t © 0 eT[fc,p„] 


\\\T-e 

n z 


OllF 


> Cmin (,„(Mfi + tl), ri ) 


e 

n 2 


( 6 ) 

l£gW_ Xhe 


(7) 


where E © 0 denotes the expectation with respect to the distribution of A when the underlying probability matrix 
is ©0 and inf ^ is the infimum over all estimators. 

If Pn > , the minimax rate of estimation is of the order p n ^ . This rate is achieved 

by the restricted least squares estimator with r >: p n and by the least squares estimator if the partition is 
balanced and p n > fc log(fc) /n. For really sparse graphs ( p n smaller than the estimation problem 

becomes rather trivial since both the null estimator T = 0 and the constant least squares estimator © with 
all entries ©,:■/ = A achieve the optimal rate p^. 
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2.5 Smooth graphons 

We now use Propositions 2.1 and 2.3 to obtain rates of convergence for the probability matrix estimation 
when 0 O satisfies the sparse graphon model (2) and the graphon W 0 £ W is smooth. For any a > 0, L > 0, 
define the class of a-Holder continuous functions E(a,L) as the set of all functions W : [0, l] 2 —>• [0,1] such 
that 

I W{x',y') - V [ai ((x,y), (x' - x,y' - y))| < L(\x' - + \ y ’ _ y| a -L“J) 

for all (x',y'), (x,y) £ [0, l] 2 , where [aj is the maximal integer less than a, and the function (x',y') K>• 
Vy a \ ((*, y), (x' — x, y' — y)) is the Taylor polynomial of degree [a] at point {x,y). It follows from the 
standard embedding theorems that, for any W £ H(a,L) and any (x',y') : (x,y) £ [0, l] 2 , 

I W(x', y’) - W{x , y)\ < M(\x’ - x|“ A1 + | y' - y\ aM ) (8) 


where M is a constant depending only on a and L. In the following, we will use only this last property of 

W £ £(a,L). 

The following two propositions give bounds on the bias terms ||©o — ©*,n 0 |lF an( ^ H®° ~ ®*IIf- 

Proposition 2.5. Consider the graphon model (2) with Wq £ T,(a,L) where a, L > 0. Let no > 2 and 
k = |n/n 0 J • Then, 

e(- 1 || ©0-0^11 ^)<CMy(i) . (9) 

We will also need the following proposition proved in [10], Lemma 2.1. 

Proposition 2.6. Consider the graphon model (2) with Wo £ E (a,L) where a,L > 0. Then, almost surely, 

1 / l \ aA1 

-j\\eo-®4 2 F<CM 2 p 2 n (^—j . ( 10 ) 


Corollary 2.7. Consider the graphon model (2) with Wq £ E (a,L) where a, L > 0 and 0 < p n < 1. Set 



2(aAl) o/-I | * -| \ 

(i) Assume that p n > n~ 'blow) (logn) 1 ' and there exists n o > 2 such that k = [n/noj. Then 

there exists a positive constant C depending only on L and a such that the least squares estimator © 
constructed with this choice of no satisfies 


E 


I® - ®°IIf 


< c 


Axffl , Pn log n 

pn if' 


' l + QAl -|- 


( 11 ) 


(ii) Assume that r > p n > Cn 2 . Then there exists a positive constant C depending only on L and a such 

r 

that the restricted least squares estimator © satisfies 


E 


1 


I© -© 


oIIf 


< C 


2+aAl 

r i+«ai 77 ,’ 


2 («ai) rloen 

' 1 + aAl -|- 


( 12 ) 


Corollary 2.7 extends the results obtained in Gao et al. [10] to arbitrary p n £ (0,1]. To simplify the 
discussion assume that a < 1. There are two ingredients in the rates of convergence in Corollary 2.7, the 

2±°L __2q_ 

nonparametric rate pn +a n 1 + a and the clustering rate pn ° s " . The smoothness index a has an impact on 
the rate only if a £ (0,1) and only if the network is not too sparse, that is if p n > Cn“ _1 (logn) 1+a . 

In [10], Gao et al. prove a lower bound showing that the rate n~ 1 + aA i + lo ^ n ) is optimal if p n = 1. 
Following the same lines as in Proposition 2.4, one can readily extend their result to prove that the rate in 
(11) is minimax optimal for p n > n~ 2+e with an arbitrarily small e > 0. 
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3 Graphon estimation problem 

In this section, our purpose is to estimate the graphon function Wo(-, •) in the sparse graphon model (2). 


3.1 From probability matrix estimation to graphon estimation 

We start by associating a graphon to any n x n probability matrix ©. This provides a way of deriving an 
estimator of /o(-, •) = p n W o(-, •) from an estimator of Qo- 

Given a n x n matrix 0 with entries in [0,1], define the empirical graphon f<~> as the following piecewise 
constant function: 

®[na;],[nj/] (13) 

for all x and y in (0,1]. The empirical graphons associated to the least squares estimator and to the restricted 
least squares estimator with threshold r will be denoted by / and f r respectively: 

/ = /§, fr = /©-■ 

They will be used as graphon estimators. For any estimator / of /o = p„Wo, define the squared error 


<5 2 (/,/o) := inf / [ |/o(r(x), r(y)) - f(x, y)\ 2 dxdy (14) 

reMj „/( 0 , 1)2 

where M is the set of all measure-preserving bijections r : [0,1] —> [0,1]. It has been proved in [12, Ch.8,13] 
that S(-, •) defines a metric on the quotient space W of graphons. 

The following lemma is a simple consequence of the triangle inequality. 

Lemma 3.1. Consider the graphon model (2). For any Wo G W, p n > 0, and any estimator T of ®o such 
that T is an n x n matrix with entries in [0,1], we have 


E 


<5 2 (/t,/o) 


< 2E 


— ||T — 0 

n 


o||f 


2E 


(/Ooi/o 


(15) 


The bound on the integrated risk in (15) is a sum of two terms. The first term containing ||T — ©o||f 
has been consideredpn Section 2 for T = © and T = © . It is the estimation error term. The second 
term containing <5 2 (/® 0 ,/o) measures the distance between the true graphon /o and its discretized version 
sampled at the unobserved random design points fi.... ,f n . We call it the agnostic error. The behavior of 
<5 2 (/© 0 , /o) depends on the topology of the considered graphons as shown below for two examples, namely, 
the step function graphons and the smooth graphons. 


3.2 Step function graphons 

Define W[k] the collection of k -step graphons, that is the subset of graphons W € W such that for some 
Q G H&sym and some <j> : [0,1] —> [fc], 

W(x,y) = for a11 e I 0 ’ 1 ] ■ ( 16 ) 

A step function W G W[k\ is called balanced if A (<)> -1 (l)) = ... = A (<7> —1 (A)) = 1 /k where A is the Lebesgue 
measure on [0,1]. The agnostic error associated to step function graphons is evaluated as follows. 

Proposition 3.2. Consider the graphon model (2). For all integers k < n, Wo G W[k ] and p n > 0 we have 


E 


S 2 


(/® 0 >/o) <Cp 



Combining this result with Lemma 3.1 and Propositions 2.1 and 2.3 we obtain the following risk bounds 
for the least squares and restricted least squares graphon estimators. 
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Corollary 3.3. Consider the graphon model (2) with Wq £ W[fc]. There exist absolute constants C± and 
C 2 such that the following holds. 

(i) If p n > nQ = \ji/2k\, and the step function Wq is balanced, then the least squares graphon 

estimator f constructed with this choice of no satisfies 


E 


S 2 



<c 1 



(ii) If p n < r, the restricted least squares graphon estimator f r satisfies 


E 


5 2 



< C 2 




(17) 


As an immediate consequence of this corollary, we get the following upper bound on the minimax risk: 


inf sup Eu/ 0 
f w 0 ew[fc] 



<C 3 


Pn 


log(fc)\ 

n 2 n ) 




(18) 


where C 3 is an absolute constant. Here, Ew 0 denotes the expectation with respect to the distribution of 
observations A' = (A l: j, 1 < j < i < n) when the underlying sparse graphon is p n Wo and inf j is the infimum 
over all estimators. The bound (18) follows from (17) with r = p n and the fact that the risk of the null 
estimator / = 0 is smaller than p 2 . 

The next proposition shows that the upper bound (18) is optimal in a minimax sense (up to a logarithmic 
factor in k in one of the regimes). 

Proposition 3.4. There exists a universal constant C > 0 such that for any sequence p n > 0 and any 
positive integer 2 < k < n, 


inf sup Eif 0 
/ w 0 ew[fe] 



> C 



(19) 


where E^ 0 denotes the expectation with respect to the distribution of observations A' = ( Aij , 1 < j < i < n) 
when the underlying sparse graphon is p n Wo and inf^- is the infimum over all estimators. 

The proof is given in Section 4.9. Showing the rate Pn\/{k — 1 )/n relies on two arguments. First, we 
construct a family of graphons that are well separated with respect to the <$(•,•) distance. The difficulty 
comes from the fact that this is not a classical L 2 distance but a minimum of the L 2 distance over all 
measure-preserving transformations. Second, we show that the behavior of the Kullback-Leibler divergences 
between the graphons in this family is driven by the randomness of the latent variables £ 1 ,..., f n while the 
nature of the connection probabilities can be neglected. This argument is similar to the “data processing” 
inequality. 

Proposition 3.4 does not cover the case k = 1 (Erdos-Renyi models). In fact, in this case the constant 
graphon estimator / = /-j achieves the optimal rate, which is equal to p n /n 2 . The suboptimality of the 
bound of Proposition 3.2 for fc = 1 is due to the fact that the diagonal blocks of / and f r are all equal 
to zero. By carefully defining the diagonal blocks of these estimators, we could have achieved the optimal 
bound p 2 \/(k — l)/n but this would need a modification of the notation and of the proofs. 

The bounds (18) and (19) imply that there are three regimes depending on the sparsity parameter p n : 

(i) Weakly sparse graphs: p n > V (^) 3 ^ 2 - The minimax risk is of the order p 2 n ^/k/n, and thus it is 

driven by the agnostic error arising from the lack of knowledge of the design. 



























(ii) Moderately sparse graphs: v (A) 2 < p n < ^=- V (^) 3A . The risk bound (18) is driven by the 

probability matrix estimation error. The upper bound (18) is of the order p n ^, which is 

the optimal rate of probability matrix estimation, cf. Proposition 2.4. Due to (19), it is optimal up to 
log(fc) factor with respect to the <5(-,-) distance. 

(iii) Highly sparse graphs: p n < V (^) 2 . The minimax risk is of the order p 2 , and it is attained by 

the null estimator. 

In a work parallel to ours, Borgs et al. [4] provide an upper bound for the risk of step function graphon 
estimators in the context of privacy. If the partitions are balanced, Borgs et al. [4] obtain the bound on the 
agnostic error as in Proposition 3.2. When there is no privacy issues, comparing the upper bound of [4] with 
that of Corollary 3.3, we see that it has a suboptimal rate, which is the square root of the rate of Corollary 
3.3 in the moderately sparse zone. Note also that the setting in [4] is restricted to balanced partitions while 
we consider more general partitions. 


3.3 Smooth graphons 


We now derive bounds on the mean squared error of smooth graphon estimation. The analysis will be based 
on the results of Section 2 and on the following bound for the agnostic error associated to smooth graphons. 

Proposition 3.5. Consider the graphon model (2) with Wq € £(a,L) where a, L > 0 and p n > 0. Then 


E 


SV&oJo) 


Pn 


<C A1 

nr\ Ot/\ 1 


( 20 ) 


where the constant C depends only on L and a. 


Combining Proposition 3.5 with Lemma 3.1 and with Propositions 2.1 and 2.3 we obtain the following 
risk bounds for the least squares and restricted least squares graphon estimators. 


Corollary 3.6. Consider the graphon model (2) with Wo € Y,(a,L) where a, L > 0 and 0 < p n < 1. Fix 

l 1/2 \ t+^a 
I Pn n 1 


k = 


cy ( -j j * i \ 

(i) Assume that p n > n~ i+ 2 (<*ai) (logn) 1 , and there exists n o > 2 such that k = \n/no\- Then, 

there exists a positive constant C\ depending only on L and a such that the least squares graphon 
estimator f constructed with this choice of no satisfies 


E 


( 7,/ 0 


< Cl 


2 + q:A1 2 (qA1) 
1 + cAl^-^- L 

fJn 


n + 


Pn log n 


nCtAl 


( 21 ) 


(ii) Assume that r > p n > Cn 2 . Then, there exists a positive constant C 2 depending only on L and a 
such that the restricted least squares graphon estimator f r satisfies 


E 


(frjo) 


2+aAl _ 2 (qA1> r log U p, 

< C2 a r 1 + aA1 n i+“Ai + 


otAl 


( 22 ) 


For the purpose of the discussion, assume that r x If p n < n a 1 log(n), the rate of convergence is 
of the order p n log (n)/n, the same as that of the probability matrix estimation risk E [||© — ©o|||,/n 2 ], cf. 
Corollary 2.7. Observe that the above condition is always satisfied when a > 1. If p n > n a ~ l log(n), the 
rate of convergence in (22) for a < 1 is of the order p 2 /n“ due to the agnostic error. This is slower than 

2 + 2 . _ _ 2 a_ 

the optimal nonparametric rate pf +a n 1 +“ for probability matrix estimation. We conjecture that this loss 
is unavoidable when considering graphon estimation with the <5 2 (-,-) error measure. We also note that the 
rates in (22) are faster than those obtained by Wolfe and Olhede [16] for the maximum likelihood estimator. 
In some cases, the improvement in the rate is up to n~ a ^ 2 . 
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4 Proofs 


In this section, we will repeatedly use Bernstein’s inequality that we state here for reader’s convenience: Let 
X \,... ,Xn be independent zero-mean random variables. Suppose that |JQ| < M almost surely, for all i. 
Then, for any t > 0, 


pjy>> 


\ 


N 




i— 1 



< e _t . 


(23) 


In the proofs of the upper bounds, to shorten the notation, we assume that instead of the n(n — l)/2 
observations A' we have the symmetrized observation matrix A, and thus 


L{Q,z) = \ Y, E (Ay — Q ab ) 2 - 

(a,£>)£[&] X [k] (i,j)£z- 1 (a)xz~ 1 (b), 


This does not change the estimators nor the results. The changes are only in the values of constants that 
do not explicitly appear in the risk bounds. 


4.1 Proof of Proposition 2.1 

Since © is a least squares estimator, we have 

||@ - ©o||!' < || ©o - ©* ||| + 2(©-© 0 ,.E)+2(©o -©*,£) (24) 

where E = A — ©o is the noise matrix. The last summand on the right hand side of (24) has mean zero. So, 
to prove the proposition it suffices to bound the expectation of (© — ©o, E). For any z £ Z n< k,n 0 , denote by 
©^ the best Frobenius norm approximation of ©o in the collection of matrices 

%:={& ■■ 3Q £ ffigym such that ©,;, = Q z ^ z ^), i ± j, and = 0 Vi}. 

We have the following decomposition: 

(© - ©o, E) = <0i - ©o, E) + (@- ® Z ,E) = (/) + (II) . 


In this decomposition, (I) is the error due to misclustering and (II) is the error due to the Bernoulli noise. 
We bound each of these errors separately. 

Control of (I) . We apply Bernstein’s inequality (23) together with the union bound over all z £ Z n ,k, no an d we 
use that (© z —© 0 , E) = 2 J2 




^0 Z — ©o^j Eij. Since the variance of Eij satisfies Var (Eij) < ||©o||c 


while E^ £ [—1,1], and the cardinality of Z nt k, no satisfies \Z nt k, no \ < k n , we obtain 


(© 5 -© 0 ,£;) > 2||© z - © 0 || F vii©o||oo(nlog(ft)+i) +-||© 5 -© 0 ||oo(^log(A:)+t) <e * 


for all t > 0. Since the entries of 0? are equal to averaged entries of ©o over blocks, we obtain that 
|| ©z — ©0 || oo 1 1 ©o II oo • Using this observation, decoupling the term 2||0 Z @ o 11 f 11 ^ o 11 oo (^ ( k ) —f— t) 

via the elementary inequality 2 uv < u 2 + v 2 and then integrating with respect to t we obtain 


E 



©o ,E) 




< C'||© 0 || oo nlog(fc) . 


(25) 


Control of (II) . The control of the error due to the Bernoulli noise is more involved. We first consider the 
intersection of T z with the unit ball in the Frobenius norm and construct a 1/4—net on this set. Then, 
using the union bound and Bernstein’s inequality, we can write a convenient bound on (©, E) for any © 
from this net. Finally we control (II) using a bound on (®,E) on the net and a bound on the supremum 
norm || A- — ©s||oo- We control the supremum norm || A- — © z ||oo using the definition of z and Bernstein’s 
inequality. 
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For any z £ Z n ^ )Tl0 , define the set T z ,i = {© £ T z '■ ||©||f < 1} and denote by A z the best Frobenius 

norm approximation of A in T z . Then, E z = A z — ® z is the projection of E onto T z - Notice that 

= Fz maximizes (©', E) over all ©' £ T z i- Denote by C z the minimal 1/4-net on 7/ 1 in the 
|]As-®d|.F ll-Eill-F 

Frobenius norm. To each V £ C z , we associate 


V £ arg 


©'er,.i n B| MIf (v,i/4) 


l©'ll 


which is a matrix minimizing the entry-wise supremum norm over the Frobenius ball £>||.|| F (V, 1/4) of radius 
1/4 centered at V. Finally, define C z := {V : V £ C z j. 

A standard bound for covering numbers implies that log \C Z \ < Chr , where |Sj denotes the cardinality of 
set S. By Bernstein’s inequality combined with the union bound, we find that with probability greater than 
l-e~K 


(©,.E) < 2a/||© 0 ||oo(^ log(fc) + k 2 + t) + -||©|| 00 (nlog(/c) + k 2 + t ) 

simultaneously for all © £ C~ with any z £ Z n k, no - Here, we have used that ||©||f < 1 for all 0 £ C z . 
Assume w.l.o.g. that A z — ©? ^ 0. By the definition of C Zl there exists © £ C z , such that 


© 


I A*-0 5 


„ d neiu < jrzw. 


Note that for this 0, the matrix 2^0 — - - ) belongs to T z l- Thus, 

V \\A2-®z\\ F j 

/ -,E)<{®,E) + - max (©', E) = (0, E) + 1 1 


since 


' \\-A.£—®z ||f ' ’ ' ' 2 O'er,, 1 ’ 7 x ’ 7 2 \ \\Az-®z\\f 

maximizes (©/ E) over all ©' £ 7i i. Using the last two displays we obtain that 

\A-z-®z ||f 


(A, - ©*, E) < 4|| A z - ©i|| FV /||© 0 ||oo(nlog(fc) + k 2 + t) + -|| A z - © i || 00 (nlog(fc) + k 2 + t) (26) 
with probability greater than 1 — e -t . 

Next, we are looking for a bound on \\A Z — ©rile*,. Since A z — ©~ = E z we have 

A — © 1 = Si'/c z(l')=z(i),z(l)=z(j ) El'l 

l * ij \{l',l): l'^l,z(l') = z(i),z(l) = z(j) I 

for all i ^ j. Consequently, we have \\A Z — ©^loo < sup m=T10i n sup s= „ 0i n X ms where 


X m s '■— SUp 


sup 


E 


(i,j)evixv 2 : Ei i 


i “ “V 111 11 I 

Vi:|Vi|=m V 2 :|V 2 |=s mS—\Vi\\V2\ 

Since no > 2, we have ms — |Vi D V 2 1 > ms — m A s > ms/2 for all m,s > no. Furthermore |{Vi : |Vi| = 
m}| < < ( en/m) m . Therefore, Bernstein’s inequality combined with the union bound over Vi, V 2 leads 

to 


Xms<C 


10 


0II 00 “ 


mlog(en/m) + slog(en/s) + t mlog(en/m) + slog(en/s) +1 


ms 


ms 


> 1 - 2e“ 


for any t > 0. From a union bound over all integers m, s £ [no, n], we conclude that 

log(n/n 0 ) +1 


A z -® z IU < C ||© 0 ||oo + 


no 


> l-2e _t . 


(27) 
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Decoupling (26) by use of the inequality 2 uv < u 2 +v 2 and combining the result with (27) we obtain 


(Ag -& Z ,E) < ^ +C(\\®o\\oo(n\og(k) + k 2 + t) 


+ 


16 

log(n/no)(nlog(fc) + k 2 ) + t(n\og(k) + k 2 +t) 
n 0 


with probability greater than 1 — 3e 4 . Integrating with respect to t leads to 
E 


16 


(A B -G B ,E)- l|A * ® 2 ^ <C( ||©o||oo(nlog(fc)+fc 2 ) + 


log(n/n 0 )(nlog(fc) + k 2 ) 


n o 


( 28 ) 


Now, note that © = Ag, and hence ||0j — ©o||f < ||© — ©o||f by definition of ©;. Thus, ||Ag — ©£||f = 
||® — ®z||f < 21|© — ©q||f- These remarks, together with (25) and (28), imply 


E 


(© - 0o,-E) 


< -E| 


© - © 0 ||! + C ||© 0 ||oo(nlog(fc) + k 2 ) + 


log(n/no)(nlog(fc) + k 2 ) 


n 0 


The result of the proposition now follows from the last inequality and (24). 


4.2 Proof of Proposition 2.3 

We first follow the lines of the proof of Proposition 2.1. As there, we have 

II©' - ©oil! < || ©0 - 0*11! + 2(©£ r - ©o, E) + 2(© r - & Br , E) + 2(© 0 - ©*, E). 
Here, E(© 0 — ©*, E) = 0, and analogously to (25), 


E 



©o ,E) 




< C'H ©ollocTT, log(fc). 


(29) 


It remains to obtain a bound on the expectation of (© — ® Zr ,E), which corresponds to the term (II) in 
the proof of Proposition 2.1 (the error due to the Bernoulli noise). To do this, we consider the subset A 
of matrices in 7i r such that their supremum norm is bounded by 2 r and Frobenius norm is bounded by 
II® ^ ®z r ||.F- As 0 — ©~ r belongs to this set it is enough to obtain a bound on the supremum of (©, E) 
for 0 € A. To control this supremum, we construct a finite subset C* that approximates well the set A both 
in the Frobenius norm and in the supremum norm. 

Consider the set 

A = {© £ Tb„ : ||©||oo < 2r, \\&\\ F < ||©' - ©zJf}. 


Then, 


(© — ©£ , E) < max(0. E) := (T, E) F 
&&A 


(30) 


where T is a matrix in A that achieves the maximum. If \\T\\ F < 2 r we have a trivial bound (T, E) F < 2 rn 
since all components of E belong to [—1,1]. Thus, it suffices to consider the case ||T||f > 2 r. In order 
to bound (T, E) F in this case, we construct a finite subset of 7i r that approximates well T both in the 
Frobenius norm and in the supremum norm. 

For each z £ Z n ^, let C z be a minimal 1/4-net of T z ,i in the Frobenius norm. Set eo = r and e q = 2 q eo for 
any integer q = 1,..., q max , where q max is the smallest integer such that 2 rn < e qmax . Cearly, q max < C log(n). 
For any V £ C z , any q = 0,..., g m ax, and any matrix U £ {—1, 0, l} fcxfc ; define a matrix V q ’ U,z £ R nx " 
with elements V'/j* 7 ’ 2 such that V^ U ’ Z = 0 for all i £ [n], and for all i ^ j, 


VI?'* = sign(Vy) (\e q Vij\ A (2r)) (1 - |t7 z(i ),y)|) + rU z{i)z(j) . 


( 31 ) 


Finally, denote by C* z the set of all such matrices: 


C* := {V q ' u ' z 


P £ Cz, q — 0, . . . , (/max; 


U £ {-l,0,l} fcxfc }. 
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For any z £ Z n ^ we have log |C*| < C(k 2 + loglog(n)) while log \Z n ^\ < nlog(fc). Also, the variance of E^ 
satisfies Var (Eij) < ||©o||oo < r, and E^ £ [—1,1]. Thus, from Bernstein’s inequality combined with the 
union bound we obtain that, with probability greater than 1 — 

(V, E) < 2||V|| F \/r(fc 2 + nlog(fc) + t) + ^ r(k 2 + nlog(k) + t ) 

o 

simultaneously for all matrices V £ Cl and all z £ Z n ^. Here, we have used that || PH 00 < 2r for all V £ Cl, 
cf. (31). It follows that, with probability greater than 1 — e -t , 

(V,E) - i-1|VII 2 , < Cr(k 2 +nlog(k) + t) (32) 

50 

simultaneously for all matrices V £ Cl and all z £ Z n 

We now use the following lemma proved in Subsection 4.10. 

Lemma 4.1. If ||T||f > 2 r there exists a matrix V in Cl , such that 

• ||T- VHf < ||T||f/4, 

• ||T-V’Hoo <r. 

Thus, on the event ||T||f > r, as a consequence of Lemma 4.1 we obtain that 2 (T —V) £ A. This and 
the definition of T imply (T —V,E)< ( T , E)p/ 2, so that (T, E)p < 2 (V, E), and thus 

(© r — &s r , E)/2 < (V , E). 


Furthermore, by Lemma 4.1 

||y||F<5||T|| F /4<5||0 r -©,J|/4. 

These remarks (recall that they hold on the event ||T||f > r), and (32) yield 


(e T - & ir ,E) - !!2_r^tJ!i 


1 {||T'|| J ,>r} > Cr (k 2 + nlog(k) + t ) 


< e 


for all t > 0. Integrating with respect to t and using that (© — & ir ,E) < rn for ||T||f < r we obtain 
E©„ 


(©' -Q Sr ,£;) - ||Q ®^\\f < C r{n\og{k) + k 2 ). 


16 


In view of this inequality and (29) the proof is now finished by the same argument as in Proposition 2.1. 


4.3 Proof of Proposition 2.4 

The proof follows the lines of Theorem 2.2 in [10]. So, for brevity, we only outline the differences from that 
proof. The remaining details can be found in [10]. First, note that to prove the theorem it suffices to obtain 
separately the lower bounds of the order p n (k/n ) 2 A p 2 and of the order p n log {k)/n A p 2 . Next, note that 
the Kullback-Leibler divergence K,(p n p, p n q) between two Bernoulli distributions with parameters p n p and 
p n q such that 1/4 < p, q < 3/4 satisfies 

fc(PnP,p n q) = 

< 

The main difference from the proof in [10] is that now the matrices ©o defining the probability distributions in 
the Fano lemma depend on p„. Namely, to prove the p n (k/n) 2 Ap 2 bound we consider matrices of connection 


Pnq log (-) + (1 - p n q) log f |—— 

\Pj ~ PnP 

q(q-p) , 1 -Pnq, , {q-p) 2 16 2 

Pn - +Pn- - (P~q) = Pn—r, - 7 < ~7TPn{q ~ P) ■ 

p 1 - p n p p(l - p n p) 3 


(33) 
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probabilities with elements ^ + ci v / p^T (—A y/~p^) u} a b with suitably chosen oj a b £ {0,1} and ci > 0 small 
enough (for p n = 1 this coincides with the definition of elements in [10]). Then the squared Frobenius 
distance between matrices ©o in the corresponding set is of the order n 2 (p n (k / n) 2 A p^), which leads to the 
desired rate, whereas in view of (33) the Kullback-Leibler divergences between the probability measures in 
this set are bounded by Cn 2 p n A 1^ < Ck 2 . Thus, the lower bound of the order p n (k/n) 2 A p 2 n follows. 

The argument used to prove the lower bound of the order p n \og(k)/n A p^ is quite analogous. To this 
end, we modify the corresponding construction of [10] only in that we take the connection probabilities of 

the form ^ + c^yfp^ A oj a with suitably chosen uj a £ {0,1} and C 2 > 0 small enough (for 

p n = 1 this coincides with the definition of probabilities B a in the proof of [10]). 


4.4 Proof of Proposition 2.5 

We prove that there exists a random matrix © measurable with respect to and with values in 

T no [k] satisfying 

E (^l!©o-©ll 2 F )<CM 2 p 2 n ^y M . (34) 

Obviously, this implies (9). To obtain such © we construct a balanced partition z* where the first k — 1 
classes contain exactly no elements and the last class contains n — no(k — 1) elements. Then we construct © 
using block averages on the blocks given by z*. 

Let n = nok + r where r is a remainder term between 0 and no — 1- Define z* : [n] —> [k] by 

(z*) -1 (a) = {i £ [n] : & = ^ for some j £ [(a - l)n 0 + l,an 0 ]} 

for each a £ {1,..., k — 1} and 

(z*)- 1 (k) = {« £ [n] : & = £y) for some j £ [{k - 1 )n 0 + 1, n]} 


where £(_,) denotes the jth order statistic. Note that with this partition the first /c — 1 classes contain no 
elements and the last class contains no + r elements. We define 


n 


* _ 

ab 


no, 

(n 0 +r)n 0 , 

(n 0 - l)n 0 , 

(no + r) (no + r 


if a ^ b,a ^ k,b ^ k, 
if a = korb = k and a ^ 6, 
if a = b and a ^ k : 

1), if a = b = k. 


Using the partition z*, we define the block average 

Qlb = Ar E PnW^,^). 

ab ie(z*)- 1 (a),je(z*)- 1 (b),ij£j 


Finally, the approximation © of ©o is defined as a symmetric matrix with entries © 7J = f° r all 

i j and ®u = 0 for all i. We have 

e(T|| 0o -0| 1 |) = J_ £ e £ (e«-£K»> 2 

' ' ae[fe],fce[fe] ie(,z m )- 1 (a),je(z*)~ 1 (b),ijij 

= ^ E E E 

ae[k],be[k] ie(z*)- 1 (a),je(z*)~ 1 (b),i^j \ 


il£(z*)~ 1 (a),v£(z*)~ 1 (b),Uy£v Pn^ 


u ab 
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Define J a = [(a — l)no + 1, ano] if a < k and Jk = [(fc — l)no + 1, n\. By definition of z* we have 


IE I —2 II©° — ©IIf 

„ n z 


= S E E £ 

a£ [&],&£ [fc] 


w (€(i)>€(j)) ~ 


^-^UEJ a ,V£Jb,U^V WX£(u),£(i;)) 


°ab 


<3 E E I 4- E EfWfej.fcjJ-WKM, £(„,))'■ 


a£[fc],b£[fc] J a ,j£ Jb^3 \ u£j a ,v£Jb,u^v 


( 35 ) 


Using (8) and Jensen’s inequality we obtain 


E(W(e (i) ,^)) - TU(^),C(.))) 2 <M 2 E 


(l£(*) -£(«)!“ +1%) -^(v)T 


< 


2 m 2 ((e i^) - £ (u) i 2 r' + (e i^-j - £ W i 2 r') 


where we set for brevity a' = a Al. Note that by definition of z* we have \i — u| < 2no and | j — v\ < 2no- 
Therefore, application of Lemma 4.10 (see Section 4.10) leads to 


E(W(e W ,ey))-^(r.)^w)) 2 < C(n 0 /n) 2a ' < C(l/k) 2a ' 
where we have used that k = \n/no\- Plugging this bound into (35) proves the proposition. 


4.5 Proof of Proposition 3.2 

For any Wq £ W[fc], we first construct an ordered graphon W' isomorphic to Wq and we set /' = p n W'. 
Then we construct an ordered empirical graphon f isomorphic to /© 0 and we estimate the £(-, -)-distance 
between these two ordered versions f and f. 

Consider the matrix ©(, with entries (©(, )%j = PnW(£i,£j) for all i,j. As opposed to @o, the diagonal 
entries of ©q are not constrained to be null. By the triangle inequality, we get 


E 


<5 2 (/© 0 , /o)] <2 e[5 2 (/©„,/©-) 


2E 


(/©') /o) 


(36) 


Since the entries of ©o coincide with those of ©q outside the diagonal, the difference /© 0 — /©> is null outside 
of a set of measure 1 jn. Also, the entries of ©q are smaller than p n . It follows that E[<5 2 (/© 0 , f&> )] < p 2 /n. 
Hence, it suffices to prove that 

ns 2 (f & 'jo)]<cp 2 n ^k/^. 

We prove this inequality by induction on k. The result is trivial for k = 1 as 5 2 fo'j = 0. Fix some 

k > 1 and assume that the result is valid for W[k — 1]. Consider any Wo £ W[k] and let Q £ Rgym an d 
(j) : [0,1] —> [A:] be associated to Wo as in definition (16). We assume w.l.o.g. that all the rows of Q are 
distinct and that A a := A {(j>~ 1 {a)) is positive for all a £ [A;], since otherwise Wo belongs to W[k— 1]. For any 
b £ [A;], define the cumulative distribution function 


b 

W) = E A » 

a—1 

and set i^(0) = 0. For any (a, b) £ [A;] x [A;] define n a b(<^>) = [F^a — 1 ),F^(a)) x [F^b — l),i<0(&)) where 
l^i(-) denotes the indicator function of set A. Finally, we consider the ordered graphon 

k k 

W’(x,y) =^2^2Q ab t nabW (x,y) . 

a—1 6—1 
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Obviously, f = p n W' is weakly isomorphic to /o = p n Wo- Let 

- 1 n 

Xa = “E 1 !^- 1 ^)} 

i— 1 

be the (unobserved) empirical frequency of group a. Here, are the i.i.d. uniformly distributed 

random variables in the graphon model (2). 

Note that the relations Yla =l Xa = Ea=i A a = 1 imply 

E E CK-K). (37) 

a: A a >A a a: A a >A a 

Consider a function ip : [0,1] —> [fc] such that: 

(i) ip{x) = a for all a £ [k] and x £ [F<j,(a — 1 ),F c / > (a — 1) + A a A A a ), 

(ii) X(ip~ 1 (a)) = A a for all a £ [k]. 

Such a function ip exists. Indeed, for each a such that A 0 > A 0 , conditions (i) and (ii) are trivially satisfied 
if we take ip- 1 {a) = [F^a — 1 1) + A 0 ), and there is an interval of Lebesgue measure A 0 — A a left 

non-assigned. Summing over all such a, we see that there is a union of intervals with Lebesgue measure 
m+ := X) a -A >a (^ a ~ Xa ) non-assigned. On the other hand, for a such that A a < A a , we must have 
ip{x) = a for x £ {F'jtpa — 1), F^a— 1) + A a ) to satisfy (i), while to meet condition (ii) we need additionally to 
assign ip(x) = a for x on a set of Lebesgue measure A a — A a . Summing over all such a, we need additionally 
to find a set of Lebesgue measure m_ := >a (^° — A a ) to make such assignments. But this set is 

readily available as a union of non-assigned intervals for all a such that A a > A a since m+ = m_ by virtue 
of (37). 

Finally define the graphon f'(x,y) = Q^( x ),ip(y)- Notice that in view of (ii) /' is weakly isomorphic to 
the empirical graphon /© 0 . Since <5(-, •) is a metric on the quotient space W, 


5 2 (fe 0 ,fo) = 6 2 (f,n< 


'[°,i ] 2 


l/'Ocy) - f\x,y)\ 2 dxdy < pi 


1 


/[ 0 , 1] 2 




dxdy. 


The two functions fo(x,y) and f(x,y) are equal except possibly the case when either x or y belongs to one 
of the intervals [F^(a — 1) + A a A A a , F^(a — 1) + A a ) for a £ [k]. Hence, the Lebesgue measure of the set 
{( x,y ) : f'(x, y) ^ f(x,y)} is not greater than 2m + = ra + + m_ = X) a =i 1^ “ -M- Thus, 


k 

< pi E l^ a ~ ^“1- 

a— 1 


Since £i,...,£ n are i.i.d. uniformly distributed random variables, n\ a has a binomial distribution with 
parameters (n, A 0 ). By the Cauchy-Schwarz inequality we get E[|A a — A a |] < yj Applying again 
the Cauchy-Schwarz inequality, we conclude that 


E 


6 2 (f & 'Jo) 



4.6 Proof of Proposition 3.5 

Arguing as in the proof of Proposition 3.2, we have 


E 


(f@o> /o) 


< 


2 Pi 


-2E 


(/©',/o) 
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where we recall that ©g is defined by (®' 0 )ij = p n W (£i , £,j ) for all i,j. Hence, it suffices to prove that 


E 


(/©;,>/o) 


< C- 


We have 


" pj/n pi/n 

<5 2 (/e',/o) = inf V / / l/oO(z), t(j/)) - (© 0+ | 2 dxdy. 




The infimum over all measure-preserving bijections is smaller than the minimum over the subclass of measure¬ 
preserving bijections r satisfying the following property 


j/n pi/n r rT(j)/ n r a(i)/n 

/ W(T(x),T(y))dxdy = 

(j-l)/nJ(i-l)/n J(<r(j)-l)/n J l)/n 


r°(j)/ n r<r(i)/n 
/ / W(x,y)dxdy 

J(.cr(j)—l)/n J (cr(i) — l)/n 


for some permutation a = (cr(l),..., cr(n)) of {1,..., n}. Such r correspond to permutations of intervals 
[(i — 1 )/n,i/n\ in accordance with a. For (x,y) £ [(cr(«) — l)/n,er(i)/n] x [(cr(j) — l)/n,a(j)/n\ we use the 
bound 


I PnW 0 {x,y) - PnWoiti ,^)| < p n W 0 (x,y) - p n W 0 


(38) 


+ 


PnW 0 - PnW 0 (€(*(i)),€(<r(j))) + \p n W 0 (£(*({)), ^(cr(j)) ) “ PnWo(&,£j)| 


where £( m ) denotes the mth largest element of the set {£i, • • ■, £ n }. We choose a random permutation 
cr = (cr(l),..., er(n)) such that £0—1 (!) < £ a - 1(2 ) < ••• < £ ct -i ( n ). With this choice of cr, we have that 
(£(<7(*))>£(<r(j))) = (&>&) and Iprr^o (£(<r(i)),£(<7(j))) ~ PnW 0 (&,£j)| = 0 almost surely. 

For the first summand in (38), as Wg(-,-) satisfies (8) and 


(^+i> ^+1) e [( CT (*) ~ 1 )+,+)+] x t(cr(j) - 1 )/n,cr{j)/n\ 


we get 


W 0 {x,v)-Wo(z$,l$ 


< 2 Ln 


(39) 


where we set for brevity a' = a A 1. To evaluate the contribution of the second summand on the right hand 
side of (38) we use that 


W r 




< M 


<7(l) /■ , CM?) /* 

n+1 _ MXO) n+1 ~ S(o-(j}) 


Squaring, integrating and taking expectation we obtain 





2a 

E 

E / / 

cr(i) p 

n+1 M<r(*)) 

dxdy 


ij J (<r(J)-l)/n J(<r(i)-l)/n 




= -E 
n 




m) 


2a 


< max E 


n+1 


£(m) 


< max (Var(£( m \))“ < Cn 


(40) 


(41) 


where we have used the relations E+ m )) = ++ Var(£( m )) < C/n, and Jensen’s inequality. The contribution 
corresponding to the second summand on the right hand side of (40) is evaluated analogously. Combining 
(39) - (41) with (38) we get 


E 


£ 2 (/e',/o)] <C Pn 


2 n~ a '. 
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4.7 Proof of Corollary 2.7 


To prove the first part of the corollary, notice that its assumptions imply that the partition is balanced and 
Pn > k\og{k)/n. Thus, similarly to Corollary 2.2, we obtain from Proposition 2.1 that 


E 


4ll®- e ollF 


< ^\\®0-&*,n o \\ 2 F +Cp r 


/log(fc) k 


\ n 


d-o 


Using Proposition 2.5 to bound the expectation of the first summand on the right hand side we get 


E 


-U|0- ©o||| 


< C 


Pn 


fc2(a Al) 


+ Pr 


( log(fc) k?_ 


(42) 


Now (11) follows from (42) by taking k = (p n n 2 ) 2(1+ “ A1) 
estimator follows from Propositions 2.3 and 2.6. 


Bound (12) for the restricted least squares 


4.8 Proof of Corollary 3.3 

To prove part (i) of the corollary, we control the size of each block of ©o. For any a in [A;], the number N a 
of nodes belonging to block a is a binomial random variable with parameters (n, 1 /k) since the graphon W 
is balanced. By Bernstein’s inequality, 

t 2 /2 

n/k + f/3 

Taking t = n/(2k) in the above inequality, we obtain 




K-sM 

< exp 


P 




< exp[— Cn/k\. 


This inequlity and the union bound over all a € [ k} imply that the size of all blocks of ©o is greater than no 
with probability greater than 1 — fcexp(— Cn/k). Together with Propositions 2.1 and 3.2, this yields 


E 


<5 2 




k exp(— Cn/k) , 


where the last summand is negligible. The second part of the corollary is a straightforward consequence of 
Propositions 2.1 and 2.3. 


4.9 Proof of Proposition 3.4 

It is enough to prove separately the following three minimax lower bounds. 


inf sup E Wo [S 2 (f,p n W 0 )} 

> 

Cp2n \l^r > 

(43) 

f W 0 GW[fc] 




inf sup E Wo [S 2 (f, p n W 0 )] 

> 

C (pn^Apl) , 

(44) 

/ WoGW[A:] 


V n ) 


inf sup E Wo [5 2 (T, PnW 0 )\ 

> 

C(^A^). 

(45) 

f VT 0 GW[2] 





4.9.1 Proof of (43) 

Without loss of generality, it suffices to prove (43) for k = 2 and for all k = 16 k with integer k large enough. 
Indeed, for any k > 2, there exists k' < k such that k — k’ < 15 and k! is either a multiple of 16 or equal to 
2, so that 


inf sup Ew 0 
/ w 0 ew[fe] 



> inf 


sup Ewo 
/ W 0 GW[fc'] 


(/; /o) 


>Cp 2 


k'-l C n 

- > -rPr 

n 4 


k- 1 
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We first consider k which is a multiple of 16. The case k = 2 is sketched afterwards. 

The proof follows the general scheme of reduction to testing of finite number of hypotheses (cf., e.g., 
[15]). The main difficulty is to obtain the necessary lower bound for the distance <5(-, •) between the graphons 
generating the hypotheses since this distance is defined as a minimum over all measure-preserving bijections. 
We start by constructing a matrix B from which the step-function graphons will be derived. 

Lemma 4.2. Fix rj q = 1/16 and assume that k is a multiple of 16 and is greater than some constant. There 
exists a k x k symmetric {—1,1} matrix B satisfying the following two properties. 

• For any (a, 6) £ [fc] with a^b, the inner product between the two columns (B at ., Bb r ) satisfies 

\{B a> .,B bt .)\<k/4. (46) 

• For any two subsets X and Y of [A] satisfying |X| = |U| = r/ok and X C\Y = 0 and any labelings 
7Ti : [r/ok] —> X and n 2 : [770A:] —> Y, we have 

rj 0 k 

^ ' [-®iri(a),iri(b) — -®-7r2(a),7r 2 (&)] — r lo^ /^- (47) 

a, 6—1 

Note that any Hadamard matrix satisfies condition (46) since its columns are orthogonal. Unfortunately, 
the second condition (47) seems difficult to check for such matrices. This is why we adopt a probabilistic 
approach in the proof of this lemma showing that, with positive probability, a symmetric matrix with 
independent Rademacher entries satisfies the above conditions. 

The graphon “hypotheses” that we consider in this proof are generated by the matrix of connection 
probabilities Q := (J + B )/2 where B is a matrix from Lemma 4.2 and J is a k x k matrix with all entries 
equal to 1. 

Fix some e < l/(4fc). Denote by Co the collection of vectors u £ {—e, e} k satisfying Yfa=i Ua = 0- For 
any u £ Co, define the cumulative distribution F u on {0,..., k} by the relations F u ( 0) = 0 and F u (a) = 
a/k+ X[b=i u b f° r a 6 [&]• Then, set II a {,(u) = [F u (a — 1), F u (a)) x [F u (b — 1), F u (b )) and define the graphon 
W u £ W[k\ by 

W u {x,y)= ^2 Qab 1 n ob ( u )(x,y). 

( a,b)£[k ] X [k] 

The graphon W u is slightly unbalanced as the weight of each class is either 1/k — e or 1/k + e. 

Let P Wu denote the distribution of observations Af := {A t j. 1 < j < i < n) sampled according to the 
sparse graphon model (2) with Wo = W u . Since the matrix Q is fixed the difficulty in distinguishing between 
the distributions P w u and Pyv„ for u ^ v comes from the randomness of the design points f n in the 

graphon model (2) rather than from the randomness of the realization of Af conditionally on £i,..., The 
following lemma gives a bound on the Kullback-Leibler divergences /C(Pw„,Pw„) between and Pw v - 

Lemma 4.3. For all u,v £ Co we have 


/C(Pw„,Pw„) < 16n/c 2 e 2 /3. 

Next, we need the following combinatorial result in the spirit of the Varshamov-Gilbert lemma [15, 
Chapter 2]. It shows that there exists a large subset of Co composed of vectors u that are well separated in 
the Hamming distance. We state this result in terms of the sets A u '■= {a £ [A] : u a = e} where u £ Co. 
Notice that, by definition of Co, we have |^4„| = k/2 for all u £ Co- 

Lemma 4.4. There exists a subset C of Co such that log |C| > kj 16 and 

\A u AA v \>k/4 (48) 


for any u ^ v £ C. 

Lemmas 4.4 and 4.2 are used to obtain the following lower bound on the distance S(W U ,W V ) between 
two distinct graphons in C. This lemma is the main ingredient of the proof. 
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Lemma 4.5. For all u,v G C such that u ^ v, the graphons W u and W v are well separated in the <5(-, •) 
distance: 

S 2 (W u ,W v )>vlke/ 2, 

so that 

S 2 (pnW u , p n W v ) > p 2 n r)lke/ 2, V u,v GC :u^v. (49) 

Now, choose e such that e 2 = ( 16 yc nfc • Then it follows from Lemmas 4.3 and 4.4 that 

K.lf’w ,Pw„) < — log ICI, V u, u € C : u ^ v. (50) 

16 

In view of Theorem 2.7 in [15], inequalities (49) and (50) imply that 


inf sup E Wo [ 6 2 (f,p n W 0 )\>Cpl 

f WoGW[fc] 

where C > 0 is an absolute constant. This completes the proof for the case when k is a large enough multiple 
of 16. 

Let now k = 2. Then, we reduce the lower bound to the problem of testing two hypotheses. We consider 
the matrix B = ^ ^ ^ and the two corresponding graphons W Ul and W U2 with u\ = (e, —e) and 

U 2 = (—e, e). Adapting the argument of Lemma 4.5, one can prove that 5 2 (p n W Ul , p n W U2 ) > p 2 e. Moreover, 
exactly as in Lemma 4.3, the Kullback-Leibler divergence between P w ui and P\y U2 is bounded by Cne 2 . 
Taking e of the order n -1//2 and using Theorem 2.2 from [15] we conclude that 

inf sup Ew 0 [5 2 (f,p n Wo)\ > Cp 2 n 

f W 0 GW[2] 

where C > 0 is an absolute constant. 

Proof of Lemma f.2. Let B be a k x k symmetric random matrix such that its elements B a ,b, a < b G [k], 
are independent Rademacher random variables. It suffices to prove that B satisfies properties (46) and (47) 
with positive probability. Fix a ^ b. Then, (B a ^., Bb,-) is distributed as a sum of k independent Rademacher 
variables. By Hoeffindg’s inequality 




P[|<B a ,.,B 6i .)| > fc/4] < 2exp[—A/32]. 


By the union bound, property (46) is satisfied for all a ^ b with probability greater than 1 — 2 k 2 exp[—fc/32]. 
For k larger than some absolute constant, this probability is greater than 3/4. 

Fix any two subsets X and Y of [k] such that \X\ = |Y| = rjok and X D Y = 0. Let tti and tt-i be any 
two labelings of X and Y. Then, define T ni>W2 := J2a°b=ii B Ma),Tn(b) ~ B ^ 2 (a)^ 2 (b)} 2 ■ By symmetry of B, 
T„ i,7r 2 /8 is greater than — -6^2(0),7r 2 (b)] 2 /4 (we have put aside the diagonal terms) where 

all the summands [-B,ri(a),7ri(b) ~ B ir 2 (a),Tr 2 (b )} 2 are independent Bernoulli random variables with parameter 
1/2 since TTi([r] 0 k}) n7r 2 ([?7ofc]) = 0- Thus, T, ri>7r2 /8 is stochastically greater than a binomial random variable 
with parameters 770^(770/0 — l)/2 and 1/2. Applying again Hoeffding’s inequality, we hnd 


T-77 




770/c 

4 


< exp 


V 2 0 k 2 

32 


For k large enough we have 77 0 fc/4 < 77q/c 2 /16 so that 7/, n2 > rfak 2 /2 with probability greater than 1 — 
exp(—77gfc 2 /32). There are less than k 2r]ok such maps (7Ti, 7r 2 ) so that property (47) is satisfied with probability 
greater than 1 — exp(277ofc log(fc) — p^k 2 /32). Again, this probability is greater than 3/4 for k large enough. 
Applying once again the union bound we hnd that properties (46) and (47) are satisfied with probability 
greater than 1/2. 

□ 
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Proof of Lemma 4-4- Let C be a maximal subset Co of points u such that the corresponding A u are k/ 4- 
separated with respect to the measure of symmetric difference distance. By maximality of C, the union of 
all balls in the Hamming distance centered at u £ C with radii k/ 4 covers Co- Denoting by B[u, fc/4] such a 
ball, we obtain by a volumetric argument 

l®[«>fc/4]|/|C 0 | > |C | _1 . (51) 

If we endow Co with the uniform probability, then \B[u, fc/4]| |Co | _1 is the probability to draw a point from 
B[u,k/A). Note that a point v is in B[u,k/ 4] if \A V \ A u \ < k/8. One can construct the set A v doing k/2 
sampling without replacement from the set [A:]. We call an a £ [k] a success if a £ A u . There are k/2 of 
them. Let S'/, denotes the number of success, then Sfc follows an hypergeometric distribution with parameters 
(k, k/2 , k/2) and we have 

; c f ]l =P[S fc >3fc/8] . 

Now we only have to bound the deviations of Sfc. It follows from [1, p.173] that Sk has the same distribution 
as the random variable E[r/|H] where y is a binomial random variable with parameters (fc/2,1/2) and B is 
some suitable er-algebra. Thus, by a convexity argument, we obtain that, for any A > 0, 

£e A(S fc -fe/4) < E( ,X(r,-k/4) <• e \ 2 k /16 


where the last bound is due to Hoeffding’s inequality. Applying the exponential Markov inequality and 
choosing A = 1, we obtain 


P Sfc — — > fe /8 <exp[—fc/16] 
and, in view of (51), we conclude that |C| > exp[fc/16]. 


□ 


Proof of Lemma 4-5- Let u and v be two different vectors in C and let r be a measure preserving bijection 
[0,1] i->- [0,1]. We aim to prove that, for any r, 

J\W u (x,y) - W v (T(x),T(y))\ 2 dxdy > r^ke/2 , (52) 

where we recall that 770 = 1/16. 

If x and x' correspond to two different classes of W u , that is x £ [F u (a — 1 ),F u (a)) and x' £ [F u (b — 
1), F u (b)) for some a^b, then the inner product between W u (x, •) and W u [x' , •) satisfies 


J(W u {x,y) - 1/2 ){W u {x' 1 y) - 1/2 )dy 


- 4 5Z ( U + Uc ) B ac B bc 

C— 1 ^ ' 

- 4^( b «,.- b O + 4 ke 

< 1 / 8 , 


(53) 


since we assume that Ake < 1. 

For any a, b £ [k], define oj a b the Lebesgue measure of [F u (a — 1 ),F u (a)] fl r([F v (b — 1), F v (b)]). Since r 
is measure preserving, ^2 b u] a b = 1 /k + u a and Y^a^ab = 1 /k + v b . For any a and b £ [ k] define h Uta {y) = 
W u (F u (a — 1 ),y) - 1/2 and k v , b (y) = W v (F v (b - 1 ),r(y)) - 1/2. Then, we have 



k k 

(x,y) - W v (t{x) 1 T(y ))\ 2 dxdy = 

a— 16=1 


\h u ,a(y) - kv,b{.y)\ 2 dy. 


Let (•, -) 2 and || • || 2 denote the standard inner product and the Euclidean norm in L 2 ([0,1]). By definition 
of W u we have that \k v ^ a (y)\ = 1/2 for all y £ [0,1] and any a £ [k] which implies ||fc „ j0 ||2 = 1/2- Now for 
b\ 7 ^ b 2l || k Vjbl — k v b2 1 | 2 > 1/2 — 1/4 = 1/8 by (53). By the triangle inequality, this implies that 

I \hu, a - k vM III + II hu, a - k v , b2 III > l|fc "’ bl ~ fc ^ 2lli > 1/16 . 
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As a consequence, for any a £ [fc] there exists at most one b £ [k] such that \\h u ^ a — fc „,&|| 2 < 1/32. If such 
an index b exists, it is denoted by it (a). Exactly the same argument shows that for any b £ [k] there exists 
at most one a £ [A;] such that || h u ^ a — A^bH 2 < 1/32 which implies that there exists no a / a' such that 
7 r(a) = 7 r(a'). Thus, it is possible to extend 7 r to a permutation of [A;]. We get 



(x,y) - W v (T(x),T(y))\ 2 dxdy 


> 


■j^ k -j^ k 

^Y Y w «.‘ = iE( 1 / t+ ““ -u ».*w)- 


a=1 b^7r(a) 


32 


If the sum X]a=i + u a — w a 7r ( a ) is greater than A;e/16, then (52) is satisfied. Thus, we can assume 
in the sequel that Yla=i + u a — w aj7r ( a ) < A;e/16. Using that < (1/A: + u a ) A (1/A; + Vb) and the 
cardinality of the collection {a £ [k] : u a > 0 } is k/2 we deduce that the collection (a £ [k] : u a > 0 , v^r a \ > 
0 and aj 0iW ( a ) > 1/A:} has cardinality greater than 7fc/16. Since for u ^ v £ C, \A U fl A v \ < 3k/8 (Lemma 
4.4), there exist subsets A C A u and B C A v of cardinality ryo k (recall that rjo = 1/16) such that 7 r(A) = B , 
A fl B = 0, and w aj7r ( Q ) > 1/A; for all a £ A. Hence, 


J\W u (x,y) - W v {r{x),T(y))\ 2 dxdy 


> 

YY f 


I W u (x,y) - W v (r(x), r(y)\ 2 dxdy 


aiGA o 2 6j4 7 [F„(ai —i),F l 

,(ai))x[F u (a 2 -l),F„(a 2 )) 


> 

^ ] y ] w ai,7r(ni) w o 2 . 

)Tr(ci2)[Q ai ,CL2 ^7r(ai). 

,ir(a 2 )] 


CL\ G-A Cl2^A 



> 

4^2 Y, [-® a i> a 2 ' 

&i G A ci2 G A 

— -^7r(ai),7r(a2)] ’ 



where the last inequality follows from the facts that Q = (J + B )/2 and w 07r ( a ) > 1/A;. Finally, we apply 
the property (47) of B to conclude that 

J\W u (x,y) - W v {r{x),T(y))\ 2 dxdy > ry 2 /8 > ylke/2. 

This proves (52) and thus the lemma. 


Proof of Lemma f.3. For u £ Co, let f{u) = (£i(it),..., £ n (u)) be the vector of n i.i.d. random variables 
with the discrete distribution on [A;] defined by P[Ci(m) = a] = 1/k + u a for any a £ [A;]. Let @ 0 be the 
n x n symmetric matrix with elements (&o)a = 0 and (Qo)ij = PnQc,i(u),Cj{u) f° r * 7 ^ /■ Assume that, 
conditionally on C(w), the adjacency matrix A is sampled according to the network sequence model with 
such probability matrix ©o- Notice that in this case the observations A' = (Ay, 1 < j < i < n) have the 
probability distribution P w u . Using this remark and introducing the probabilities a a (u) = P[</(w) = a] and 
PAa = P[A' = A|</(it) = a] for a S [A;] n , we can write the Kullback-Leibler divergence between P\y u and P 


in the form 


K-^Wul^Wv) 


YY PAa a a(u) log 

A a 


f T,aPA a g a {u) \ 
\12a.PAa.U a {v) ) 


where the sums in a are over [A;]" and the sum in A is over all triangular upper halves of matrices in {0, l} nxn . 
Since the function [x, y) i—>■ x\og{x/y) is convex we can apply Jensen’s inequality to get 


£(I ^W U ^W V ) < 


Y^, a a( u ) log 



(54) 


where the last equality follows from the fact that a a (u) are n- product probabilities. Since the Kullback- 
Leibler divergence does not exceed the chi-square divergence we obtain 


Y (X/ k + u a)\0g 

a£[fc] 


/ l/k + Uq \ ^ (Uq ~ Vq) 2 

\l/k + vj ~£f k] l/k + v a 


< 16fc 2 e 2 /3, 
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v a | < 2e. Combining this with (54) proves the 

□ 


where last inequality uses that |w a | < e < 1 /(4fc), and | u a — 
lemma. 


4.9.2 Proof of (44) 

As in the proof of Proposition 2.4, we use here Fano’s method. The main difference is that the graphon 
separation in <5(-, •) distance is more difficult to handle than the matrix separation in the Frobenius distance. 

First, note that it is sufficient to prove (44) for k > ko where ko is any fixed integer. Indeed, if k < ko, 
the lower bound (44) immediately follows from (45). 

Let Co denote the set of all symmetric k x k matrices with entries in {—1,1}. The graphon hypotheses 
that we consider in this proof are generated by matrices of intra-class connection probabilities of the form 
Q b := (J + tB)/2 where B £ Co, e £ (0,1/2), and J is a k x k matrix with all entries equal to 1. Given a 
matrix Q B , define the graphon W B £ W[fc] by the formula 

W B (x,y)= £ (Q b ) ab^-[(a—l)/k,a/k)x[(b—l)/k,b/k)(%iy)- 

(a,b)£[fc] X [fc] 


As in the previous proof, we use the following combinatorial result in the spirit of the Varshamov-Gilbert 
lemma [15, Chapter 2] that grants the existence of a large subset C of Co such that the matrices B £ C are 
well separated in some sense. Given any two permutations 7r and tC of [fc] and any matrix B , we denote by 
B K ^ a matrix with entries B/£ = Bir( 0 )i'(i). 


Lemma 4.6. Let ko be an integer large enough. For any k > ko, there 
log |C| > k 2 /32 and such that 



exists a subset C of Co satisfying 


(55) 


for all permutations n, it' and all B i ^ B 2 £ C. 


We assume in the rest of this proof that k is greater than k 0 . As noticed above, it is enough to prove 
(44) only in this case. We choose a maximal subset C satisfying the properties stated in Lemma 4.6. The 
next lemma shows that the separation between matrices B in C translates into separation between the 
corresponding graphons W B . 


Lemma 4.7. Let C be a maximal set satisfying the properties stated in Lemma 4-6- For any two distinct 
Bi and B 2 in C, we have 

5 2 {pnW Bl ,PnW B ,) > pit 2 / 8 . (56) 

Finally, in order to apply Fano’s method, we need to have an upper bound on the Kullback-Leibler 
divergence between the distributions Pw B f° r B £ C. It is given in the next lemma. 

Lemma 4.8. For any B\ and B 2 in C, we have 


KL{Pw Bl ,Pw B2 ) < 3n 2 p n e 2 . 

Choosing now e = ^ n ^p- A l) /32 we deduce from Lemmas 4.8 and 4.6 that 

JC(P WBl ,P WB2 )<^log|C|, VB 1 ,B 2 £C:B 1 /B 2 . (57) 

In view of Theorem 2.7 in [15], inequalities (56) and (57) imply that 

inf sup Ewo [<5 2 (/,p„W 0 )] > C ( p n -\ A p 2 ^ 

/ Woew[fc] \ n ) 


where C > 0 is an absolute constant. This completes the proof of (44). 
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Proof of Lemma f.6. Define the pseudo-distance du on Co by du{B\, B 2 ) := min||I?i — B^ ||fj where 

n,7v' 

the minimum is taken over all permutations of [fc]. Let C be a maximal subset Co of matrices B that are 
y/k(k + l)/2-separated with respect to du- By maximality of C, the union of all balls in du distance centered 
at B € C with radii yjk(k + l)/2 covers Co- Denoting by B[B, y/k(k + l)/2, dn] such a ball, we obtain by a 
volumetric argument 


B[B,y/k(k + l)/2,d n ] /\Co\ > 1C !" 1 . 


Since du{B\, Bo) is the infimum over all fc ! 2 permutations of the distance \\Bi — B^' n ||f, we have 


B[B, \/fc(fc + l)/ 2 , du] 


< fc ! 2 


B[B,y/k(k + l)/2, ||.||f] 


where B[B , ^/fc(fc + l)/2, || • ||f] is the ball centered at B with respect to the Frobenius distance. Given B 1 
and B 2 in Co, ||I?i — L? 2 ||f /4 is equal to the Hamming distance between B\ and Bo- Consider the ball 
B[B, fc(fc +1)/8, d h] centered at B with respect to the Hamming distance. As matrices in Co are symmetric, 
Co is in bijection with {0, i} fc ( fc + 1 )/ 2 . Using Hamming bound and Varshamov-Gilbert lemma [15, Chapter 2] 
we obtain 

2~ k ( k+1 )/ 2 \B[B 1 k{k + l)/8,d H }\ < ICT 1 < e - fc ( fc+1 )/ 16 

where C 1 is a maximal subset of matrices B that are fc ( fc + 1 ) -separated with respect to the Hamming distance. 
We then conclude that 


|C|> 


|Co| 


k\ 2 \B[B, y/k(k + l)/2, || • ||f]| 


> exp [fc{(fc + 1)/16 — 2 log(fc)}] 


which is larger exp(fc 2 /32) for fc large enough. 

□ 


Proof of Lemma 4- 7 . Let Bi and B 2 be two distinct matrices in C and let t be a measure preserving bijection 
[0,1] 1 —^ [0,1]. Our aim is to prove that, for any such r, 



1 {x,y) - Wb 2 ( T { x ), T (y))\ 2 dxdy > e 2 /8. 


(58) 


For any a, b £ [fc], let co a b denote the Lebesgue measure of [(a — l)/fc,a/fc] fl t([(6 — l)/fc, 6 /fc]). Since r is 
measure preserving, Yhb^ab = 1/fc and = Hence, the fc x fc matrix fcw, where u> = (wai,) a ,be[fc], 

is doubly stochastic. For any permutation 7 r of [fc], denote H(w) the corresponding permutation matrix. By 
the Birkhoff von Neumann theorem [ 8 ], fcw is a convex combination of permutation matrices, that is there 
exist positive numbers 'y„ such that = 1 /fc, and lo = 7 where the summation runs over all 

permutations. Using these remarks we obtain 


I W Bl {x,y) - Wb 2 ('’'(x), r(y))\ 2 dxdy 
4 


^ 

^ ^ ^<210,2^61 £>2 [(-^l)ai2>i (-^2)0262] 


ai,a2,£>i,£)2G[/c] 


^ VI 77ri77T2-^( 7r l)aia2-^( 7r 2)bi6 2 [C®l)ai&i (-^2)a 2 b 2 ] 


7ti,7T2 ai , 0,2 ,i>i , 62 ^ [&] 

(by the definition of permutation matrices) 
^2 

= “j" y ] lv-ll-K 2 [(^l)o6 _ (^2^ (o)7T 2 (6)] 


a,b 


= -7 - B 


7n,7T2 ||2 

2 IIF 


> 


7 ^ 17^2 

7Tl ,7T2 


e 2 fc 2 
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where we have used Lemma 4.6 and the property 7 *- = 1/k. 


□ 


Proof of Lemma f.8. The proof is quite similar to that of Lemma 4.3. Fix two matrices B i and B 2 in C. 
Let f = (Ci,, Cn) be the vector of n i.i.d. random variables with uniform distribution on [k]. Let @1 be 
the n x n symmetric matrix with elements (©i)ii = 0 and (©i)ij = p n [ 1 + (Ui)^ c ]/2 for i ^ j. Assume 
that, conditionally on f(u), the adjacency matrix A is sampled according to the network sequence model 
with such probability matrix © 1 . Notice that in this case the observations A' = (Ay, 1 < j < i < n) have 
the probability distribution P w Bl ■ Using this remark, we introduce the probabilities a a = P[£ = a] and 

P'ao, = = A|C = a] for a £ [k] n . Next, we introduce the analogous probabilities p'^'f for a matrix @ 2 

depending on B 2 in the same way as ©1 depends on B\. The Kullback-Leibler divergence between P w Bl 
and P w B2 has the form 


K,(V Wbi 


P ^ 2 ) =J2J2 aaP A ) a l °S 

A a 


( Eg 

VEg aJll) 


where the sums in a are over [/c]™ and the sum in A is over all triangular upper halves of matrices in {0, l} nxn . 
Since the function ( x , y) 1 —> a; log (x/y) is convex we can apply Jensen’s inequality to get 


/C(Pu/ Bi , P w B2 ) <J 2 aa J 2 PaI lo § 

a A 



Here, the sum in A for fixed a is the Kullback-Leibler divergence between two n(n— l)/2-products of Bernoulli 
measures, each of which has success probability either p n ( 1 + e)/2 or p n ( 1 — e)/2. Thus, for p = (1 + e)/2 
and q = (1 — e )/2 we have 

Tl\Tl — 1 ) 

£QPV Bl ,Pw B2 ) <- 2 — -Pn K (P^)^ (59) 

where n(p, q ) is the Kullback-Leibler divergence between the Bernoulli measures with success probabilities p 
and q. Since the Kullback-Leibler divergence does not exceed the chi-square divergence we obtain n(p, q) < 
( p — q) 2 (p~ l + g -1 ) = 4e 2 /(1 — e 2 ) < 16e 2 /3 for any e < 1/2. The lemma now follows by substitution of this 
bound on n(p,q) into (59). □ 


4.9.3 Proof of (45) 

We use here a reduction to the problem of testing two simple hypotheses by Le Cam’s method. Fix some 
0 < e < 1/4. Let W\ be the constant graphon with W\(x, y) = 1/2, and let W 2 £ W[2] be the 2-step graphon 
with W 2 (x, y) = 1/2 + e if x, y £ [0,1/2 ) 2 U [1/2, l ] 2 and W 2 (:r, y) = 1/2 — e elsewhere. Obviously, we have 

S 2 (PnW 1 ,p n W 2 )=p 2 n e 2 . (60) 


We have 


inf max E^IJ^/^Wo)] > 

f Wq(z{Wi ,W2 } 


8 2 (f,p n Wi) + S 2 (f,p n W 2 )) min(dPWi,dPw 2 ) 


> 


> 


8 2 (PnW 1 ,p n W 2 ) 


min(dPwi > dPw 2 


PW 


exp (-X 2 (IV 2 ,Pwi)) 


(61) 


where x 2 (Pw 2 , P w 1 ) is the chi-square divergence between P w 2 and Puq. In the last inequality we have used 
(2.24) and (2.26) from [15], and (60). Finally, the following lemma allows us to conclude the proof by setting 


e = 


A - 

np n /N 4- 


Lemma 4.9. There exists an absolute constant cq > 0 such that x 2 (Pw 2 , P Wi ) < 1/4 if e satisfies np n e 2 < cq. 
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Proof of Lemma 4-9- Let L(A ') be likelihood ratio of P w 2 wifi 1 respect to Puy. Since y 2 (Pw 2 ,Pwi) = 
Ew' 1 [L(A') 2 ] — 1 it remains to prove that E^y [^(A/) 2 ] < 5/4. 

For the sake of brevity, we write in what follows E[-] = Ew 1 [•]• We also set po := p n /2, pi := p n ( 1/2 + e) 
and p 2 := Pn(l/2 - e). 

As the graphon p n W 2 is a 2-step function, we may assume that the components of £ = (£i,..., £„) are 
i.i.d. uniformly distributed on {0,1}. Given £, dehne the collection 5 := {{a, b} : £ a = £&} of subsets of 
indices with identical position. For {i,j} in S (resp. S c ), A r] follows a Bernoulli distribution with parameter 
Pi (resp. P 2 ). Denote p the distribution of S and n^ = n(n — l)/2 the number of subsets of size 2 of [n]. 
Then, the likelihood has the form 

L(A') = 

Ls(A') := 


J L s (A')dp(S) , 

/ l-pi \ |S| / l~p 2 \ n() ~ |S| -pr / Pi(l ~Po) \ Aab -pr / P 2 (l~Po) \ A ' 
\l~Po) \l-Po) 


By Fubini theorem, we may write E [L 2 (A')\ = /E [Ls 1 (A i )Ls 2 {A') ] dp(Si)dp(S 2 ), where 


E [LsAA')L S 2 {A')\ = 


1 „ x IS 1 I+IS 2 I /, \ 2ra< 2 >-|S 1 |-|S 2 | 

1 — Pi \ / 1 — P2 


1 -Po) 


1 ~P0 


E 


n ( P i(i-Po) Y Aab x 

\Po(l—Pl) J 

{a,b}£SiC\S2 


{a,6}eS£nSf 

-1 , (pi-po ) 2 

' Po(l-Po) 


/p 2 (l—Po) \ 2Aah 

FT 

/ P1P2 (1 Po ) 2 '\ j4ab 

\Po(l—P2 )) 

11 

\Po( 1 -Pl)( 1 -P 2 )/ 


{a,b}eS'iAS 2 


|Sins 2 | r 


1 + 


(P2-P0) 2 

Po(l-Po) 


| S'JrOS'21 


1 + 


Using the definition of po , Pi and p 2 , we hnd 


E [L Sl (A')Ls 2 {A')\ = 


1 + 


2 2 1 |s 1 ns 2 |+|sjns|| r 

Pn 


Po(1-Po) 


1 - 


Pi P2 +Pn -pi Po -P2P0 

po(l-po) 


2 2 -| |SiAS 2 | 

Pn fc 


|SiAS 2 | 


Po(l - Po) 


< 


1 + 


2 2 -,|s 1 ns 2 |+|s 1 c ns||-|s 1 AS 2 | 

Pn^ 


Po(l-Po). 


< exp 


(2|5i n S 2 | + 2|5f n 5|| - n (2) ) 4p. n 


Thus, to bound the second moment of L{A!), it suffices to control an exponential moment of T := |Si n<S 2 | + 
I ■S'l n S 2 1 where Si and S 2 are independent and distributed according to p. To handle this quantity, we denote 
by £ = (£ 1 ,..., £ n ) and by £ = (£{..... f ' n ) the positions for the first and second sample corresponding to 
Si and S 2 , respectively. Next, for any i, j £ {0, l} 2 , define 

Nij := |{a -.fa = i and £' a = j}\ . 

Then we have 2|5i D S 2 | + n = Nq 0 + + Nf 0 + N 2 ± and 2|Sf D 5|| = 2WoWi + 2NoiNiq so that 

2 T + n = (Wo + W 1) 2 + (A 01 + Wo) 2 ■ Define the random variable Z := Wo + Wi — n/2. It has a centered 
binomial distribution with parameters n and 1/2. We have 

2 T - n (2) = (n/2 + Z) 2 + (n/2 - Z) 2 -n- n (2) = 2 Z 2 - n/2 . 

Plugging this identity into the expression for E[L 2 (A')\, we conclude that 

E [L 2 {A')} < E [exp (8 p n e 2 Z 2 )] , 


where E[-] stands for the expectation with respect to the distribution of Z. By Hoeffding’s inequality, Z 
has a subgaussian distribution with subgaussian norm smaller than n 1 ' 2 . Consequently, E[L 2 (A')\ < 5/4 as 
soon as 8 p n ne 2 is smaller than some numerical constant. □ 
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4.10 Technical lemmas 

Lemma 4.10. Let X±,... ,X n be i.i.d. uniformly distributed variables on [0,1] and is the ith element 

of the ordered sample < X^ 2 ) < • • ■ < X( n y Then, for any n 0 < n, and 0 < s < Uq, 


E (X(j) - X ( i+S )) 2 


s(s + 1 ) < (no\ 2 

(n + 1 )(n + 2) — V n ) 


Proof. Note that — A(, +a ) ~ Beta(s, n — s + 1). For Y ~ Beta(/3, 7 ) we have 


E(y 2 ) = 


+ 1 ) 


(/3 + 7 + l)(/3 + 7) 


which implies the lemma. 


□ 


Proof of Lemma 4.1. Since HTUoo < 2 r, the definition of q m ax and the fact that ||T|| F > r imply eg < 


I F < e 9max . Denote by q the integer such that 2 l tq < ||T||f < e q. Let Vq be a matrix minimizing 


q,U ,z r 


||r/||T|| F - V|| F over all V in C ir . Then, take V = V 0 
over all U £ (—1,0, l} kxk . Notice that U is also a minimizer of HT— V 0 

~ * q,U,z<- 

since both T and V, 

Denoting by 0 the zero k x k matrix we have 


where U is a minimizer of IIT — V, 


q,U,z r 


q,U ,z r 


Hoc over all U £ (-1, 0,1} 


kxk 


0 are block-constant matrices with the same block structure determined by z r . 


\T-V\\ f < 


< 

\\f-vt’°’ Zr \\ F 

(since U is a minimizer 

= 

l|r-e 4 Vo||F 

(since ||T||oo < 2 r) 

= 

\\t~eqf/\\t\\ F \\ F 

+ ^||T/||T|| F -t/ 0 || F 

< 

(l|r|| F -e 4 )+c 9 ||2 

7||t|| F -V 0 ||F 

< 

(||f|| F -e 4 ) + ! 

(since C^ r is a 1/4-net) 

< 

imi F /4. 



q,U,z r 


Next, since V = V, 


q,U ,z r 


minimizes || T V 


q,U ,z r 
0 


| |oo over U we have 


|T- V||oc < \\T - Vq U * 


where U* is the matrix with elements defined by the relation U* b = sign(T,y) if i £ z r 1 (a), j £ z r 1 (b) for 

(a, b) £ [A:] x [k]. Thus, all the entries of Vq r in each block are either equal to r or to —r depending on 
whether the value of T on this block is positive or negative respectively. Since HTHoo < 2r we obtain that 


I T V 


q,U*,z 


< r. 


□ 
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